Skip to content

Initial work for file format writer API#3119

Open
nssalian wants to merge 5 commits intoapache:mainfrom
nssalian:file-format-initial-work
Open

Initial work for file format writer API#3119
nssalian wants to merge 5 commits intoapache:mainfrom
nssalian:file-format-initial-work

Conversation

@nssalian
Copy link
Copy Markdown
Contributor

@nssalian nssalian commented Mar 3, 2026

Initial work for #3100. Since this is a large change, doing it in parts similar to the AuthManager so it's easier to review and move the existing code around.

Rationale for this change

Introduces the pluggable file format writer API: FileFormatWriter, FileFormatModel, and
FileFormatFactory in pyiceberg/io/fileformat.py. Moves DataFileStatistics from pyarrow.py with a
re-export for backward compatibility. The move is more forward looking and the idea is to keep the stats generic in the future as we add additional formats too.

This is the first part of work for #3100. No behavioral changes; the write path remains hardcoded to Parquet.

Are these changes tested?

Yes. tests/io/test_fileformat.py tests backward-compatible import of DataFileStatistics

Are there any user-facing changes?

No

@nssalian nssalian marked this pull request as ready for review March 3, 2026 19:01
@nssalian
Copy link
Copy Markdown
Contributor Author

nssalian commented Mar 6, 2026

CC: @kevinjqliu @Fokko @geruh for review

OutputFile,
OutputStream,
)
from pyiceberg.io.fileformat import DataFileStatistics as DataFileStatistics
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from pyiceberg.io.fileformat import DataFileStatistics as DataFileStatistics
from pyiceberg.io.fileformat import DataFileStatistics

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_result: DataFileStatistics | None = None

@abstractmethod
def write(self, table: pa.Table) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A table looks to be the logical starting point, but I think an iterator of RecordBatches would also make sense. WDYT @kevinjqliu

def partition(self, partition_spec: PartitionSpec, schema: Schema) -> Record:
return Record(*[self._partition_value(field, schema) for field in partition_spec.fields])

def to_serialized_dict(self) -> dict[str, Any]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to change this into a TypedDict as a return type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it over from the original implementation. I can do a TypedDict in a follow up when I wire it through if that works?

Comment on lines +175 to +178
def get(cls, file_format: FileFormat) -> FileFormatModel:
if file_format not in cls._registry:
raise ValueError(f"No writer registered for {file_format}. Available: {list(cls._registry.keys())}")
return cls._registry[file_format]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think PyIceberg diverges a bit from Java on this point. PyIceberg could have multiple implementatons for Parquet for example (Arrow/fsspec). Maybe we want something similar to the FileIO loading:

SCHEMA_TO_FILE_IO: dict[str, list[str]] = {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the FileFormatFactory as the Python equivalent of Java's FormatModelRegistry, keyed by FileFormat alone since Python only has Arrow (vs Java needing (FileFormat, Class<?>) for Spark/Flink/Generic). Let me know if you think it's worth adding a property-based override.

@nssalian nssalian requested a review from Fokko March 27, 2026 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants